Quick Commands List

Legend: df = dataframe pd = pandas

pd.read_csv("file.csv")

Viewing The DataFrame

df.describe() <- Very Useful
df.columns <- Read Headers (names of each column)
^- Output: Index(['text'], dtype='object')

Quick Previews

df.head("3")
df.tail("2")

Headers

df['text'] <- Read each columns in the mentioned column name
df['text1', 'text2'] Reading multiple columns

Rows

df.iloc[1] <- Reading the first row
df.iloc[1,2] <- Read row ∩ column

Iterating to each row

for index, row in df.iterrows():
    print(index, row['Name'])

df.loc[df['Type 1'] == "Grass]

Sorting/Descending Exists

df.sort_values(ascending=[False])
df.sort_values(['Type 1, hp']) <- Multiple columns are allowed

Data Frames
(looks like an excel table)
We can think of Data Frames as a combination of multiple series

index=[] is basically the row
columns is columns

import pandas as pd

certificates_earned = pd.DataFrame({
    'Certificates': [8, 2, 5, 6],
    'Time (in months)': [16, 5, 9, 12]
})
names = ['Tom', 'Kris', 'Ahmad', 'Beau']
certificates_earned.index = names

Data Cleaning

Four-step process of data cleaning:
- 1st step: Identify and handle missing data, which can be straightforward or complex depending on context.
- 2nd step: Address invalid values, which might involve removing or transforming them.
- 3rd step: Deal with domain-specific issues, such as values that are valid but unlikely in the given context.
- 4th step: Utilize functions in Pandas to work with missing values, using methods like isna(), notna(), and dropna().

Functions: isna() notna() dropna()
Attributes: s.isna() s.notna() s.dropna()

Pandas Functions
create new series/dataframes without modifying the original data.
help identify missing values, while dropna() removes them.

The dropna function can remove rows or columns with missing values, and you can specify axis and thresholds.

pd.isnull(np.nan)
Pandas library's isnull function to check if a value is null

np.nan - (Not A Number)

NumPy library
special floating-point value
signifies missing or undefined data

Why would there be an "np.nan"?

Compatibility np.nan is recognized by various libraries within the scientific Python ecosystem, including Pandas, SciPy, and scikit-learn. This makes it easier to work with missing data across different tools.
Represents Missing Data: Other values, such as strings or integers, may not be appropriate to represent missing data in a numerical context.

Data frames can be analyzed using methods like info and shape to understand structure and missing values.

Replacing Missing Values with Specific Values

Syntax: DataFrame.fillna(value, *method=ffill, bfill*)

The fillna method can replace missing values with specific values
can be used with methods like forward fill and backward fill.
Allows you to fill missing or NaN (Not a Number) values in a DataFrame/Series with specified values

By default, it returns a new DataFrame with filled values.

Method Values:
ffill - copies the current value to it's forward's missing value in the same COLUMN
bfill - carries the current value to it's backward's missing value in the same COLUMN

Categorical column cleaning involves using unique or value_counts to identify invalid values, followed by replacing or fixing them.
For more complex fixes, coding skills might be required, such as when handling ages with typographical errors.

Dealing with duplicates in a dataset

Duplicates are a common concern in data analysis
Require defining what constitutes a duplicate value.
The Dataframe.duplicated() method in pandas helps identify duplicate values based on specified rules.
- ^- subset=[] attribute is used to narrow down selection
- operates on rows in a DataFrame or elements in a Series

# Check for duplicated rows based on specific columns
duplicates_subset = df.duplicated(subset=['Name', 'Age'])

Returns BOOLEAN Values:

True: Specifically, it marks an element as True if it's the same as a previous element/s
False: If current value isn't a duplicate from the previous element/s
The Dataframe.drop_duplicates() removes duplicate rows from DataFrame based on certain criteria.

Keep Parameters for attributes:
-> keep='first, last, false')-> -> first occurrence (default), last occurrence, ALL DUPLICATES -> -> can be put as parameter todrop_duplicates()andduplicated()`

String handling methods in pandas

The "str" attribute in pandas provides methods for string handling similar to Python's string methods.
String methods like "split," "contains," "strip," and "replace" are available within the "str" attribute.
String handling methods often mirror equivalent methods in standard Python string manipulation.

Created: 2024-03-03